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While hidden class models of various types arise in many statis- 
tical applications, it is often difficult to establish the identifiability of 
their parameters. Focusing on models in which there is some struc- 
ture of independence of some of the observed variables conditioned 
on hidden ones, we demonstrate a general approach for establishing 
identifiability utilizing algebraic arguments. A theorem of J. Kruskal 
for a simple latent-class model with finite state space lies at the core 
of our results, though we apply it to a diverse set of models. These 
include mixtures of both finite and nonparametric product distribu- 
tions, hidden Markov models and random graph mixture models, and 
lead to a number of new results and improvements to old ones. 

In the parametric setting, this approach indicates that for such 
models, the classical definition of identifiability is typically too strong. 
Instead generic identifiability holds, which implies that the set of non- 
identifiable parameters has measure zero, so that parameter inference 
is still meaningful. In particular, this sheds light on the properties 
of finite mixtures of Bernoulli products, which have been used for 
decades despite being known to have nonidentifiable parameters. In 
the nonparametric setting, we again obtain identifiability only when 
certain restrictions are placed on the distributions that are mixed, 
but we explicitly describe the conditions. 

1. Introduction. Statistical models incorporating latent variables are widely 
used to model heterogeneity within datasets, via a hidden structure. How- 
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ever, the fundamental theoretical question of the identifiability of param- 
eters of such models can be difficult to address. For specific models it is 
even known that certain parameter values lead to nonidentifiability, while 
empirically, the model appears to be well behaved for most values. Thus 
parameter inference procedures may still be performed, even though theo- 
retical justification of their consistency is still lacking. In some cases (e.g., 
hidden Markov models [39]), it has been formally established that generic 
choices of parameters are identifiable, which means that only a subset of 
parameters of measure zero may not be identifiable. 

In this work, we consider a number of such variable models, all of which 
exhibit a conditional independence structure, in which (some of) the ob- 
served variables are independent when conditioned on the unobserved ones. 
In particular, we investigate: 

1. finite mixtures of products of finite measures, where the mixing param- 
eters are unknown (including finite mixtures of multivariate Bernoulli 
distributions), also called latent-class models in the literature; 

2. finite mixtures of products of nonparametric measures, again with un- 
known mixing parameters; 

3. discrete hidden Markov models; 

4. a random graph mixture model, in which the probability of the presence 
of an edge is determined by the hidden states of the vertices it joins. 

We show how a fundamental algebraic result of Kruskal [29, 30] on 3- way ta- 
bles can be used to derive identifiability results for all of these models. While 
Kruskal's work is focused on only 3 variates, each with finite state spaces, 
we use it to obtain new identifiability results for mixtures with more vari- 
ates (point 1, above), whether discrete or continuous (point 2). For hidden 
Markov models (point 3), with their more elaborate dependency structure, 
Kruskal's work allows us to easily recover some known results on identifiabil- 
ity that were originally approached with other tools, and to strengthen them 
in certain aspects. For the random graph mixture model (point 4), in which 
the presence/absence of each edge is independent conditioned on the states 
of all vertices, we obtain new identifiability results via this method, again 
by focusing on the model's essential conditional independence structure. 

While we establish the validity of many identifiability statements not 
previously known, the major contribution of this paper lies as much in the 
method of analysis it introduces. By relating a diverse collection of models 
to Kruskal's work, we indicate the applicability of this method of estab- 
lishing identifiability to a variety of models with appropriate conditional 
independence structure. Although our example of applying Kruskal's work 
to a complicated model such as the random graph requires substantial ad- 
ditional algebraic arguments tied to the details of the model, it illustrates 
well that the essential insight can be a valuable one. 
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Finally, we note that in establishing identifiability of the parameters of 
a model, this method clearly indicates one must allow for the possibility of 
certain "exceptional" choices of parameter values which are not identifiable. 
However, as these exceptional values can be characterized through algebraic 
conditions, one may deduce that they are of measure zero within the param- 
eter space (in the finite-dimensional case). Since "generic" parameters are 
identifiable, one is unlikely to face identifiability problems in performing in- 
ference. Thus generic identifiability of the parameters of a model is generally 
sufficient for data analysis purposes. Although the notion of identifiability of 
parameters off a set of measure zero is not a new one, neither the usefulness 
of this notion nor its algebraic origins seem to have been widely recognized. 

2. Background. Latent structure models form a very large class of mod- 
els including, for instance, finite univariate or multivariate mixtures [34], 
hidden Markov models [5, 16] and nonparametric mixtures [33]. 

General formulations of the identification problem were made by several 
authors, and pioneering works may be found in [27, 28]. The study of iden- 
tifiability proceeds from a hypothetical exact knowledge of the distribution 
of observed variables and asks whether one may, in principle, recover the 
parameters. Thus identification problems are not problems of statistical in- 
ference in a strict sense. However, since nonidentifiable parameters cannot 
be consistently estimated, identifiability is a prerequisite of statistical pa- 
rameter inference. 

In the following, we are interested in models defined by a family A4(0) = 
{fe,9 € 0} of probability distributions on some space $7, with parameter 
space (not necessarily finite dimensional) . The classical definition of iden- 
tifiability, which we will refer to as strict identifiability, requires that for any 
two different values 6^9' in 0, the corresponding probability distributions 
P# and Fgi are different. This is equivalent to injectivity of the model's pa- 
rameterization map fy, which takes values in A4\(£l), the set of probability 
measures on Q, and is defined by ^f(6) = Fe- 
in many cases, the above map will not be strictly injective. For instance, 
it is well known that in models with discrete hidden variables (such as fi- 
nite mixtures or discrete hidden Markov models), the latent classes can be 
freely relabeled without changing the distribution of the observations, a phe- 
nomenon known as "label swapping." In this sense, the above map is always 
at least r!-to-one, where r is the number of classes in the model. However, 
this does not prevent the statistician from inferring the parameters of these 
models. Indeed, parameter identifiability up to a permutation on the class 
labels (which we henceforth consider as a type of strict identifiability), is 
largely enough for practical use, at least in a maximum likelihood setting. 
Note that the label swapping issue may cause major problems in a Bayesian 
framework (see, for instance, [34], Section 4.9). 
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A related concept of local identifiability only requires the parameter to 
be unique in small neighborhoods in the parameter space. For parametric 
models (i.e., when the parameter space is finite dimensional), with some 
regularity conditions, there is an equivalence between local identifiability of 
the parameters and nonsingularity of the information matrix [40] . When an 
iterative procedure is used to approximate an estimator of the parameter, 
different initializations can help to detect multiple solutions of the estima- 
tion problem. This often corresponds to the existence of multiple parameter 
values giving rise to the same distribution. However, the validity of such 
procedures relies on knowing that the parameterization map is, at most, 
finite-to-one, and a precise characterization of the value of k such that it is 
a fc-to-one map would be most useful. 

Thus knowledge that the parameterization map is finite-to-one might be 
too weak a result from a statistical perspective on identifiability. Moreover, 
we argue in the following that infinite-to-one maps might not be problematic, 
as long as they are generic fe-to-one maps for known finite k. 

While all our results are proved relying on the same underlying tool, they 
must be expressed differently in the parametric framework (including the 
finite case) and in the nonparametric one. 

The parametric framework. While the focus on one-to-one or /c-to-one 
parameterization maps is well suited for most of the classical models en- 
countered in the literature, it is inadequate in some important cases. For 
instance, it is well known that finite mixtures of Bernoulli products are not 
identifiable [23], even up to a relabeling of latent classes. However, these 
distributions are widely used to model data when many binary variables 
are observed from individuals belonging to different unknown populations, 
and parameter estimation procedures are performed in this context. For in- 
stance, these models may be used in numerical identification of bacteria 
(see [23] and the references therein) . Statisticians are aware of this apparent 
contradiction; the title of the article [6], practical identifiability of finite mix- 
tures of multivariate Bernoulli distributions, indicates the need to reconcile 
nonidentifiability and validity of inference procedures, and clearly indicates 
that the strict notion of identifiability is not useful in this specific context. 
We establish that parameters of finite mixtures of multivariate Bernoulli 
distributions (with a fixed number of components) are in fact generically 
identifiable (see Section 5). 

Here, "generic" is used in the sense of algebraic geometry, as will be 
defined in the subsection on algebraic terminology below. Most importantly, 
it implies that the set of points for which identifiability does not hold has 
measure zero. In this sense, any observed data set has probability one of 
being drawn from a distribution with identifiable parameters. 
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Understanding when generic identifiability holds, even in the case of finite 
measures, can be mathematically difficult. There are well-known examples of 
latent-class models in which the parameterization map is in fact infinite-to- 
one, for reasons that are not immediately obvious. For instance, Goodman 
[22] describes a 3-class model with four manifest binary variables and thus 
a parameter space of dimension 3(4) + 2 = 14. Though the distributions re- 
sulting from this model lie in a space of dimension 2 4 — 1 = 15, the image of 
the parameterization map has dimension only 13. From a statistical point 
of view, this results in nonidentifiability. 

An important observation that underlies our investigations is that many 
finite space models (e.g., latent-class models, hidden Markov models) in- 
volve parameterization maps which are polynomial in the scalar parame- 
ters. Thus statistical models have recently been studied by algebraic geome- 
ters [19, 37]. Even in the more general case of distributions belonging to 
an exponential family, which lead to analytic but nonpolynomial maps, it 
is possible to use perspectives from algebraic geometry (see, for instance, 
[2, 3, 13]). Algebraic geometers use terminology rather different from the 
statistical language, for instance, they describe the image of the parameter- 
ization map of a simple latent-class model as a higher secant variety of a 
Segre variety. When the dimension of this variety is less than expected, as 
in the example of Goodman above, the variety is termed defective, and one 
may conclude the parameterization map is generically infinite-to-one. Re- 
cent works such as [1, 7, 8] have made much progress in determining when 
defects occur. 

However, as pointed out by Elmore, Hall and Neeman [14], focusing on 
dimension is not sufficient for a complete understanding of the identifiability 
question. Indeed, even if the dimensions of the parameter space and the 
image match, the parameterization might be a generically fc-to-one map, and 
the finite number k cannot be characterized by using dimension approaches. 
For example, consider latent-class models, assuming the number r of classes 
is known. In this context, even though the dimensions agree, we might have 
a generically /c-to-one map with k > r\. (Recall that r! corresponds to the 
number of points which are equivalent by permutating label classes.) 

This possibility was already raised in the context of psychological studies 
by Kruskal [29] , whose work in [30] provides a strong result ensuring generic 
r!-to-oneness of the parameterization map for latent r-class models under 
certain conditions. Kruskal's work, however, is focused on models with only 
3 observed variables, or, in other terms, on secant varieties of Segre products 
with 3 factors, or on 3-way arrays. While the connection of Kruskal's work to 
the algebraic geometry literature seems to have been overlooked, the nature 
of his result is highly algebraic. 

Although [14] is ultimately directed at understanding nonparametric mix- 
tures, Elmore, Hall, and Neeman address the question of identifiability of 
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the parameterization for latent-class models with many binary observed vari- 
ables (i.e., for secant varieties of Segre products of projective lines with many 
factors, or on 2 x 2 x ••■ x 2 tables). These are just the mixtures of Bernoulli 
products referred to above, though the authors never introduce that ter- 
minology. Using algebraic methods, they show that with sufficiently many 
observed variables, the image of the parameterization map is birationally 
equivalent to a symmetrization of the parameter space under the symmetric 
group E r . Thus for sufficiently many observed variables, the parameteriza- 
tion map is generically r!-to-one. (Although the generic nature of the result 
is not made explicit, that is, however, all that one can deduce from a bira- 
tional equivalence.) Their proof is constructive enough to give a numerical 
understanding of how many observed variables are sufficient, though this 
number's growth in r is much larger than is necessary (see Corollary 5 and 
Theorem 8 for more details). 

The nonparametric framework. Nonparametric mixture models have re- 
ceived much attention recently [4, 25, 26]. They provide an interesting frame- 
work for modelling very general heterogeneous data. However, identifiability 
is a difficult and crucial issue in such a high-dimensional setting. 

Using algebraic methods to study statistical models is most straightfor- 
ward when state spaces are finite. One way of handling continuous random 
variables via an algebraic approach is to discretize the problem by binning 
the random variable into a finite number of sets. For instance, [11, 15, 26] 
developed cut points methods to transform multivariate continuous obser- 
vations into binomial or multinomial random variables. 

As already mentioned, Elmore, Hall and Neeman [14] consider a finite 
mixture of products of continuous distributions. By binning each contin- 
uous random variable X to create a binary one, defined by the indicator 
function 1{X < t}, for some choice of t, they pass to a related finite model. 
But identification of a distribution is equivalent to identification of its cu- 
mulative distribution function (c.d.f.) F(t) = ¥(X < t). Having addressed 
the question of identifiability of the parameters of a mixture of products of 
binary variables, they can thus argue for the identifiability of the parameters 
of the original continuous model, as they continue to do in [24]. However, 
because the authors are not explicit about the generic aspect of their re- 
sults in [14], there are significant gaps in the formal justification of their 
claims. Moreover, the bounds they claim on the number of observed vari- 
ables which ensure generic identifiability leave much room for improvement, 
as they point out. 

The general approach. Our theme in this work is the applicability of 
the fundamental result of Kruskal on 3-way arrays to a spectrum of models 
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with latent structure. Though our approach is highly algebraic, it has lit- 
tle in common with that of [14, 24], for establishing that with sufficiently 
many observed variables, the parameterization map of r-latent-class models 
is either generically r!-to-one in the parametric case, or that it is exactly 
r!-to-one (under some conditions) in the nonparametric case. Our results 
apply not only to binary variables, but as easily to ones with more states, 
or even to continuous ones. In the case of binary variables (multivariate 
Bernoulli mixtures), we obtain a much lower upper bound for a sufficient 
number of variables to ensure generic identifiability (up to label swapping) 
than the one that can be deduced from [14], and, in fact, our bound gives the 
correct order of growth, log 2 r. (The constant factor we obtain is, however, 
still unlikely to be optimal.) 

While our first results are on the identifiability of finite mixtures (with 
a fixed number of components) of finite measure products, our method has 
further consequences for more sophisticated models with a latent structure. 
Our approach for such models with finite state spaces can be summarized 
very simply: we group the observed variables into 3 collections, and view 
the composite states of each collection as the states of a single clumped 
variable. We choose our collections so that they will be conditionally in- 
dependent, given the states of some of the hidden variables. Viewing these 
hidden variables as a single composite one, the model reduces to a special 
instance of the model Kruskal studied. Thus Kruskal's result on 3- way ta- 
bles can be applied, after a little work, to show that Kruskal's condition is 
satisfied. This might be done either by showing that the clumping process 
results in a sufficiently generic model (ensuring Kruskal's condition is au- 
tomatically satisfied for generic parameters), or that explicit restrictions on 
the parameters ensure this clumping process satisfies Kruskal's condition. In 
more geometric terms, we embed a complicated finite model into a simple 
latent-class model with 3 observed variables, taking care to verify that the 
embedding does not end up in the small set for which Kruskal's result tells 
us nothing. 

To take up the continuous random variables case, we simply bin the real- 
valued random variables into a partition of M into k intervals and apply the 
previous method to the new discretized random variables. As a consequence, 
we are able to prove that finite mixtures of nonparametric independent vari- 
ates, with at least 3 variates, have identifiable parameters under a mild and 
explicit regularity condition. This is in sharp contrast not only with [14, 24], 
but also with works such as [26], where the components of the mixture are 
assumed to be independent but also identically distributed and [25], which 
dealt only with r = 2 groups (see Section 7 for more details) . 

We note that Kruskal's result has already been successfully used in phy- 
logeny, to prove identifiability of certain models of evolution of biological 
sequences along a tree [3]. However, application of Kruskal's result is limited 
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to hidden class models, or to other models with some conditional indepen- 
dence structure, which have at least 3 observed variates. Kruskal's theorem 
can sometimes be used for models with many hidden variables, by consid- 
ering a clumped latent variable Z = {Z\, . . . , Z n ). We give two examples of 
such a use for models presenting a dependency structure on the observa- 
tions, namely hidden Markov models (Section 6.1) and mixture models for 
random graphs (Section 6.2). For hidden Markov models, we recover many 
known results, and improve on some of them. For the random graph mixture 
model, we establish identifiability for the first time. Note that in all these 
applications we always assume the number of latent classes is known, which 
is crucial in using Kruskal's approach. Identification of the number of classes 
is an important issue that we do not consider here. 

Algebraic terminology. Polynomials play an important role throughout 
our arguments, so we introduce some basic terminology and facts from al- 
gebraic geometry that we need. For a more thorough but accessible intro- 
duction to the field, we recommend [10]. 

An algebraic variety V is defined as the simultaneous zero-set of a finite 
collection of multivariate polynomials {fi}f = x C C[x±,X2, ■ ■ ■ ,£&], 

V = V(fx,. . . , f n ) = {a G C fc |/i(a) = 0, 1 < i < n}. 

A variety is all of C k only when all fi are 0; otherwise, a variety is called 
a proper subvariety and must be of dimension less than k, and, hence, of 
Lebesgue measure in C k . Analogous statements hold if we replace C k by 
]R fc , or even by any subset G C R fc containing an open fe-dimensional ball. 
This last possibility is of course most relevant for the statistical models of 
interest to us, since the parameter space is naturally identified with a full- 
dimensional subset of [0,1]^ for some L (see Section 3 for more details). 
Intersections of algebraic varieties are algebraic varieties as they are the 
simultaneous zero-set of the unions of the original sets of polynomials. Finite 
unions of varieties are also varieties, since if sets S\ and S2 define varieties, 
then {fg\f € S\,g € £2} defines their union. 

Given a set 0CR* of full dimension, we will often need to say some 
property holds for all points in Q, except possibly for those on some proper 
subvariety G n V(/i, ■ • ■ , fn)- We express this succinctly by saying the prop- 
erty holds generically on G. We emphasize that the set of exceptional points 
of G, where the property need not hold, is thus necessarily of Lebesgue mea- 
sure zero. 

In studying parametric models, G is typically taken to be the parameter 
space for the model, so that a claim of generic identifiability of model pa- 
rameters means that all nonidentifiable parameter choices lie within a proper 
subvariety, and thus form a set of Lebesgue measure zero. While we do not 
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always explicitly characterize the subvariety in statements of theorems, one 
could do so by careful consideration of our proofs. 

In a nonparametric context, where algebraic terminology appropriate to 
the finite-dimensional setting is inappropriate, we avoid the use of the term 
"generic." Instead, we always give explicit characterizations of those param- 
eter choices which may not be identifiable. 

Roadmap. We first present finite mixtures of finite measure products 
with a conditional independence structure (or latent-class models) in Sec- 
tion 3. Then, Kruskal's result and consequences are presented in Section 4. 
Direct consequences on the identifiability of the parameters of finite mixtures 
of finite measure products appear in Section 5. More complicated dependent 
variables models, including hidden Markov models and a random graph mix- 
ture model, are studied in Section 6. In Section 7, we consider mixtures of 
nonparametric distributions, analogous to the finite ones considered earlier. 
All proofs are postponed to Section 8. 

3. Finite mixtures of finite measure products. Consider a vector of ob- 
served random variables {Xj}i<j<# where Xj has finite state space with 
cardinality Kj. Note that these variables are not assumed to be i.i.d. nor 
to have the same state space. To model the distribution of these variables, 
we use a latent (unobserved) random variable Z with values in {l,...,r}, 
where r is assumed to be known. We interpret Z as denoting an unob- 
servable class, and assume that conditional on Z, the X^s are independent 
random variables. The probability distribution of Z is given by the vector 
7r = (-7Tj) £ (0, l) r with 5^7Tf = 1. Moreover, the probability distribution of 
Xj conditional on Z = i is specified by a vector py € [0, 1} k k We use the 
notation Pij(l) for the lib. coordinate of this vector (1 < I < Kj). Thus we 

have EiPij(0 = !■ 

For each class i, the joint distribution of the variables X±, . . . ,X p condi- 
tional on Z = i is then given by a p-dimensional K\ x • • • x n p table 



p 



Pi = 0Pii, 



5=1 



whose (Zi, I2, ■ ■ ■ , Z p )-entry is ]lj=i Pi? ((?')• Let 



r 





i=l 



Then P is the distribution of a finite mixture of finite measure products, with 
a known number r of components. The 7Tj are interpreted as probabilities 
that a draw from the population is in the ith of r classes. Conditioned on the 
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class, the p observable variables are independent. However, since the class 
is not discernible, the p feature variables Xj described by one-dimensional 
marginalizations of P are generally not independent. 

We refer to the model described above as the r-class, ^-feature model with 
state space {1, . . . , k\} x • • • x {1, . . . , k p }, and denote it by A4(r; k\,K2, ■ ■ ■ , n p ) 
Identifying the parameter space of this model with a subset G of [0, 1] L where 
L = (r — 1) + rJ2f = i(Ki — 1) and letting K = nf=i K i> the parameterization 
map for this model is 

In the following, we specify parameters by vectors such as 7r and p^- , always 
implicitly assuming their entries add to 1. 

As previously noted, this model's parameters are not strictly identifiable 
if r > 1, since the sum in (1) can always be reordered without changing P. 
Even modulo this label swapping, there are certainly special instances when 
identifiability will not hold. For instance, if Pj = P^, then the parameters 7Tj 
and TTj can be varied, as long as their sum 7Tj + ttj is held fixed, without 
effect on the distribution P. Slightly more elaborate "special" instances of 
nonidentifiability can be constructed, but in full generality, this issue remains 
poorly understood. Ideally, one would know for which choices of r,p, (/%), 
generic values of the model's parameters are identifiable up to permutation of 
the terms in (1), and, additionally, have a characterization of the exceptional 
set of parameters on which identifiability fails. 

4. Kruskal's theorem and its consequences. The basic identifiability re- 
sult on which we build our later arguments is a result of Kruskal [29, 30] 
in the context of factor analyses for p = 3 features. Kruskal's result deals 
with a 3-way contingency table (or array) which cross-classifies a sample 
of n individuals with respect to 3 polytomous variables (the ith of which 
takes values in {1, . . . , If there is some latent variable Z with values in 
{1, . . . ,r} so that each of the n individuals belongs to one of the r latent 
classes and within the Zth latent class, the 3 observed variables are mutu- 
ally independent, then this r-class latent structure would serve as a simple 
explanation of the observed relationships among the variables in the 3-way 
contingency table. This latent structure analysis corresponds exactly to the 
model A4(r; Ki, K2, K3) described in the previous section. 

To emphasize the focus on 3-variate models, note that in [30] Kruskal 
points out that 2- way tables arising from the model A4(r; ^1,^2) do not have 
a unique decomposition when r > 2. This nonidentifiability is intimately re- 
lated to the nonuniqueness of certain matrix factorizations. While Goodman 
[22] studied the model M.(r; Ki, K2, K3, K4) for fitting to 4-way contingency 
tables, no formal result about uniqueness of the decomposition was estab- 
lished. In fact, nonidentifiability of the model under certain circumstances 
is highlighted in that work. 
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To present Kruskal's result, we introduce some algebraic notation. For 
j = 1, 2, 3, let Mj be a matrix of size r x k,j, with = (mj (1), . . . , m\ (Kj)) 
the ith row of Mj . Let [Mi , Mi , M3] denote the K\ x k 2 x K3 tensor defined 
by 

r 

[Mi , M 2 , M 3 ] = J2 m i ® m ? ® m l 

i=l 

In other words, [Mi,M2,M3] is a three-dimensional array whose (u,v,w) 
element is 

r 

[Mi , M 2 , M 3 ] w = m \ ( u ) m l ( v ) m i H 
i=l 

for any l<u<«i,l<«<«2)l<ty<K3- Note that [Mi, M2, M3] is left un- 
changed by simultaneously permuting the rows of all the Mj and/or rescaling 
the rows so that the product of the scaling factors used for the , j = 1, 2, 3, 
is equal to 1. 

A key point is that the probability distribution in a finite latent-class 
model with three observed variables is exactly described by such a tensor: 
let Mj, j = 1, 2, 3, be the matrix whose ith row is py = F(Xj = ■ \ Z = i). Let 
Mi = diag(7r)Mi be the matrix whose ith. row is TTiPn- Then the (u,v,w) 
element of the tensor [Mi, M2, M3] equals P(Xi = u, X 2 = v, X3 = w). Thus 
knowledge of the distribution of (Xi, X 2 , X3) is equivalent to the knowledge 
of the tensor [Mi,M 2 ,M3]. Note that the Mj's are stochastic matrices, and 
thus the vector of 7Tj's can be thought of as scaling factors. 

For a matrix M, the Kruskal rank of M will mean the largest number i" 
such that every set of / rows of M are independent. Note that this concept 
would change if we replaced "row" by "column," but we will only use the 
row version in this paper. With the Kruskal rank of M denoted by rank^- M, 
we have 

rank^ M < rank M 

and equality of rank and Kruskal rank does not hold in general. However, 
in the particular case where a matrix M of size p x q has rank p, it also has 
Kruskal rank p. 

The fundamental algebraic result of Kruskal is the following. 

Theorem 1 (Kruskal [29, 30]). Let I j =raxik K M j . If 

h + h + h > 2r + 2, 

then [Mi,M 2 ,M3] uniquely determines the Mj, up to simultaneous permu- 
tation and rescaling of the rows. 
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The equivalence between the distributions of 3-variate latent-class models 
and 3-tensors, combined with the fact that rows of stochastic matrices sum 
to 1, gives the following reformulation. 

Corollary 2. Consider the model Ad (r; K\, ^2,^3), with the parameter- 
ization of Section 3. Suppose all entries of it are positive. For each j = 1, 2, 3, 
let Mj denote the matrix whose rows are p™, i = 1, . .. , r, and let Ij denote 
its Kruskal rank. Then if 

h + h+h>2r + 2, 
the parameters of the model are uniquely identifiable, up to label swapping. 

By observing that Kruskal's condition on the sum of Kruskal ranks can 
be expressed through polynomial inequalities in the parameters, and thus 
holds generically, we obtain the following corollary 

Corollary 3. The parameters of the model A4(r; k±,K2, ^3) are gener- 
ically identifiable, up to label swapping, provided 

min(r, K\) + min(r, K2) + min(r, K3) > 2r + 2. 

The assertion remains valid if, in addition, the class proportions {vr^ }i<^< r 
are held fixed and positive in the model. 

For the last statement of this corollary we note that if the mixing pro- 
portions are positive then one can translate Kruskal's condition into a poly- 
nomial requirement so that only the parameters Pij(l) = ^"{Xj =l \ Z = i) 
appear. Thus the generic aspect only concerns this part of the parameter 
space, and not the part with the proportions 7Tj. As a consequence, the 
statement is valid when the proportions are held fixed in (0,1). This is of 
great importance, as often statisticians assume that these proportions are 
fixed and known (for instance using 7Tj = 1/r for every 1 < i < r). Without 
observing this fact, we would not have a useful identifiability result in the 
case of known 7Tj, since fixing values of the 7Tj results in considering a sub- 
variety of the full parameter space, which a priori might be included in the 
subvariety of nonidentifiable parameters allowed by Corollary 3. 

5. Parameter identifiability of finite mixtures of finite measure products. 

Finite mixtures of products of finite measure are widely used to model data, 
for instance in biological taxonomy, medical diagnosis or classification of 
text documents [21, 35]. The identifiability issue for these models was first 
addressed forty years ago by Teicher [42]. Teicher's result states the equiv- 
alence between identifiability of mixtures of product measure distributions 



IDENTIFIABILITY IN LATENT STRUCTURE MODELS 



13 



and identifiability of the corresponding one-dimensional mixture models. As 
a consequence, finite mixtures of Bernoulli products are not identifiable in a 
strict sense [23]. Teicher's result is valid for finite mixtures with an unknown 
number of components, but it can easily be seen that nonidentifiability oc- 
curs even with a known number of components [6], Section 1. The very 
simplicity of the equivalence condition stated by Teicher [42] likely impeded 
statisticians from looking further at this issue. 

Here we prove that finite mixtures of Bernoulli products (with a known 
number of components) are in fact generically identifiable, indicating why 
these models are well behaved in practice with respect to statistical param- 
eter inference, despite their lack of strict identifiability [6]. 

To obtain our results, we must first pass from Kruskal's theorem on 3- 
variate models to a similar one for p-variate models. To do this, we observe 
that p observed variables can be combined into 3 agglomerate variables, so 
that Kruskal's result can be applied. 

Theorem 4. Consider the model M(r; k\, . . . , k p ) where p>3. Sup- 
pose there exists a tripartition of the set S = {l,...,p} into three disjoint 
nonempty subsets Si , S2 , S3 , such that if m = Y\j&Si^j th en 

(2) min(r, k±) + min(r, K2) + min(r, K3) > 2r + 2. 

Then model parameters are generically identifiable, up to label swapping. 
Moreover, the statement remains valid when the mixing proportions {vrj}i<j< r 
are held fixed and positive. 

Considering the special case of finite mixtures of r Bernoulli products 
with p components [i.e., the r-class, p-binary feature model Mir; 2, 2, . . . , 2)], 
to obtain the strongest identifiability result, we choose a tripartition that 
maximizes the left-hand side of inequality (2). Doing so yields the following. 

Corollary 5. Parameters of the finite mixture of r different Bernoulli 
products with p components are generically identifiable, up to label swapping, 
provided 

P>2[log 2 r] +1, 
where \x] is the smallest integer at least as large as x. 

Note that generic identifiability of this model for sufficiently large values 
of p is a consequence of the results of Elmore, Hall and Neeman, in [14], 
although neither the generic nature of the result, nor the fact that the model 
is simply a mixture of Bernoulli products, is noted by the authors. Moreover, 



14 



E. S. ALLMAN, C. MATIAS AND J. A. RHODES 



our lower bound on p to ensure generic identifiability is superior to the one 
obtained in [14]. Indeed, letting C(r) be the minimal integer such that if 
p>C(r) then the r-class, p-binary feature model is generically identifiable, 
then [14] established that 

log 2 r < C(r) < log 2 r 

for some effectively computable constant oi- While the lower bound for C(r) 
is easy to obtain from the necessity that the dimension of the parameter 
space, rp+ (r — 1), be no larger than that of the distribution space 2 P — 1, 
the upper bound required substantial work. Corollary 5 above establishes 
the stronger result that 

C(r) <2[log 2 r] +1. 

Note that this new upper bound, along with the simple lower bound, shows 
that the order of growth of C(r) is precisely log 2 r. 

For the more general M(r; k, . . . , k) model, our lower bound on the num- 
ber of variates needed to generically identify the parameters, up to label 
swapping, is 

P> 2[log K r] + 1. 

The proof of this bound follows the same lines as that of Corollary 5, and 
is therefore omitted. 

6. Hidden classes models with dependent observations. In this section, 
we give several additional illustrations of the applicability of Kruskal's result 
in the context of dependent observations. The hidden Markov models and 
random graph mixture models we consider may at first appear to be far from 
the focus of Kruskal's theorem. This is not the case, however, as in both 
the observable variables are independent when appropriately conditioned 
on hidden ones. We succeed in embedding these models into an appropriate 
A4(r; «i, K2, K3) and then use extensive algebraic arguments to obtain the 
(generic) identifiability results we desire. 

6.1. Hidden Markov models. Almost 40 years ago, Petrie ([39], Theorem 
1.3) proved generic identifiability, up to label swapping, for discrete hidden 
Markov models (HMMs). We offer a new proof, based on Kruskal's theorem, 
of this well-known result. This provides an interesting alternative to Petrie's 
more direct approach, and one that might extend to more complex frame- 
works, such as Markov chains with Markov regime, where no identifiability 
results are known (see, for instance, [9]). Moreover, as a by-product, our 
approach establishes a new bound on the number of consecutive variables 
needed, such that the marginal distribution for a generic HMM uniquely 
determines the full probability distribution. 
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We first briefly describe HMMs. Consider a stationary Markov chain 
{Z n } n >Q on state space {1, . . . , r} with transition matrix A and initial dis- 
tribution 7r (assumed to be the stationary distribution). Conditional on 
{Z n } n >Q, the observations {X n } n >o on state space {1, ...,«} are assumed 
to be i.i.d., and the distribution of each X n only depends on Z n . De- 
note by B the matrix of size r x k containing the conditional probabilities 
P(X n = k \ Z n = i). The process {X n } n >Q is then a hidden Markov chain. 
Note that this is not a Markov process. The matrices A and B constitute 
the parameters for the r-hidden state, K-observable state HMM, and the pa- 
rameter space can be identified with a full-dimensional subset of R r ( r + K_2 ) . 
We refer to [5, 16] for more details on HMMs. 

Petrie [39] describes quite explicitly, for fixed r and k, a subvariety of 
the parameter space for an HMM on which identifiability might fail. Indeed, 
Petrie proved that the set of parameters on which identifiability holds is the 
intersection of the following: the set of regular HMMs; the set where the 
components of the matrix B, namely P(X n = k \ Z n = i) are nonzero; the 
set where some row of B has distinct entries [namely there exists some i 6 
{1, . . . , r} such that all the {¥{X n = k\Z n = i)}^ are distinct]; the set where 
the matrix A is nonsingular, and 1 is an eigenvalue with multiplicity one for 
A [namely, P'(l, A)^0 where P(A, A) = det(A7 - A)]. Regular HMMs were 
first described by Gilbert in [20]. The definition relies on a notion of rank 
and an HMM is regular if its rank is equal to its number of hidden states r. 
More details may be found in [17, 20]. 

The result of Petrie assumes knowledge of the whole probability distribu- 
tion of the HMM. But it is known ([17], Lemma 1.2.4) that the distribution 
of an HMM with r hidden states and k observed states, is completely de- 
termined by the marginal distribution of 2r consecutive variables. An even 
stronger result appears in [38], Chapter 1, Corollary 3.4: the marginal dis- 
tribution of 2r — 1 consecutive variables suffices to reconstruct the whole 
HMM distribution. Combining these results shows that generic identifiabil- 
ity holds for HMMs from the distribution of 2r — 1 consecutive observations. 
Note there is no dependence of this number on k, even though one might 
suspect a larger observable state space would aid identifiability. 

Using Kruskal's theorem we prove the following. 

Theorem 6. The parameters of an HMM with r hidden states and k 
observable states are generically identifiable from the marginal distribution 
of 2k + 1 consecutive variables provided k satisfies 



(3) 




While we do not explicitly characterize a set of possibly nonidentifiable 
parameters as Petrie did, in principle we could do so. 
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Fig. 1. Embedding the hidden Markov model into a simpler latent-class model. 

Note, however, that we require only the marginal of 2k + 1 consecutive 
variables, where k satisfies an explicit condition involving k. The worst case 
(i.e., the largest value for k) arises when k = 2, since k \—> is an 

increasing function for positive k. In this worst case, we easily compute 
that 2k + 1 = 2r — 1 consecutive variables suffice to generically identify the 
parameters. Thus our approach yields generic versions of the claims of [17] 
and [38] described above. 

Moreover, when the number k of observed states is increased, the minimal 
value of 2k + 1 which ensures identifiability by Theorem 6 becomes smaller. 
Thus the fact that generic HMMs are characterized by the marginal of 2k + 1 
consecutive variables, where k satisfies (3), results in a much better bound 
than 2r — 1 as soon as the observed state space has more than 2 points. In 
this sense, our result is stronger than the one of Paz [38] . 

In proving Theorem 6, we embed the hidden Markov model in a simpler 
latent-class model, as illustrated in Figure 1. The hidden variable Z^ is the 
only one we preserve, while we cluster the observed variables into groups 
so they may be treated as 3 observed variables. According to properties of 
graphical models (see, e.g., [31]), the agglomerated variables are independent 
when conditioned on Z^. So Kruskal's theorem applies. However, additional 
algebraic arguments are needed to see that the embedding gives sufficiently 
generic points so that we may identify parameters. 

To conclude this section, note that Leroux [32] used Teicher's result [42] to 
establish a sufficient condition for exact identifiability of parametric hidden 
Markov models, with possibly continuous observations. 

6.2. A random graph mixture model. We next illustrate the application 
of our method to studying a random graph mixture model. This hetero- 
geneous model is used in a wide range of applications, such as molecular 
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biology (gene interactions or metabolic networks [12]), social sciences (re- 
lationships or co-authorship networks [36]) or the study of the world wide 
web (hyperlinks graphs [44]). 

We consider a random graph mixture model in which each node belongs to 
some unobserved class, or group, and conditional on the classes of all nodes, 
the edges are independent random variables whose distributions depend only 
on the classes of the nodes they connect. More precisely, consider an undi- 
rected graph with n nodes labeled 1, . . . , n and where presence/absence of an 
edge between two different nodes i and j is given by the indicator variable 
Xij. Let {^i}i<i< n be i.i.d. random variables with values in {1, . . . , r} and 
probability distribution 7r £ (0, l) r representing node classes. Conditional 
on the classes of nodes {Zi}, the edge indicators are independent ran- 
dom variables whose distribution is Bernoulli with some parameter pZiZj- 
The between-groups connection parameters pij € [0,1] satisfy p^ = Pji, for 
all 1 < i,j < r. We emphasize that the observed random variables Xij for 
this model are not independent, just as the observed variables were not 
independent in the mixture models considered earlier in this paper. 

The interest in the random graph model lies in the fact that different 
nodes may have different connectivity properties. For instance one class 
may describe hubs which are nodes with a very high connectivity, and a 
second class may contain the others nodes with a lower overall connectivity. 
Thus one can model different node behaviours with a reasonable number of 
parameters. Examples of networks easily modelled with this approach, and 
more details on the properties of this model, can be found in [12]. 

This model has been rediscovered many times in the literature and in 
various fields of applications. A nonexhaustive bibliography includes [12, 18, 
36, 41]. However, identifiability of the parameters for this random graph 
model has never been addressed in the literature. 

Frank and Harary [18] study the statistical inference of the parameters in 
the restricted a-/3 or affiliation model. In this setup, only two parameters 
are used to model the intra-group and inter-group probabilities of an edge 
occurrence pu = a, 1 < i < r and pij = (3, 1 < i < j < r. Using the total 
number of edges, the proportion of transitive triads and the proportion of 3- 
cycles among triads (see definitions (3) and (4) in [18]), they obtain estimates 
of the parameters a, (3 and sometimes r, in various cases (a, (3 unknown 
and 7Tj = 1/r with unknown r, for instance). However, they do not discuss 
the uniqueness of the solutions of the nonlinear equations defining those 
estimates (see (16), (29) and (33) in [18]). 

We prove the following result. 

Theorem 7. The parameters of the random graph model with r = 2 
node states are strictly identifiable, up to label swapping, provided there are 
at least 16 nodes and the connection parameters {^11,^121^22} ore distinct. 
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Our basic approach is to embed the random graph model into a model to 
which Kruskal's theorem applies. Since we have many hidden variables, one 
for each node, we combine them into a single composite variable describing 
the states of all nodes at once. Since the observed edge variables are binary, 
we also combine them into collections, to create 3 composite edge variables 
with more states. We do this in such a way that the composite edge vari- 
ables are still independent conditioned on the composite node state. The 
main technical difficulty is that the matrices whose entries give probabili- 
ties of observing a composite edge variable conditioned on a composite node 
state must have well-understood Kruskal rank. This requires some involved 
algebraic work. 

The random graph model will be studied more thoroughly in a forthcom- 
ing work. The special case of the affiliation model, which is not adressed by 
Theorem 7, will be dealt with there as well. 

7. Finite mixtures of nonparametric measure products. In this section, 
we consider a nonparametric model of finite mixtures of r different prob- 
ability distributions fj,%, . . . ,fj, r on MP, with p > 3. For every 1 < i < r, we 
denote by the jth marginal of \Xi and F- the corresponding c.d.f. (de- 
fined by F- (t) = f^((— oo,i]) for any t G M). Without loss of generality, we 
may assume that the functions F- are absolutely continuous. 

For our first result, we assume further that the mixture model has the 
form 

r r p 

(4) p=5>^=]>>n/i 

i=l i=l j=l 

in which, conditional on a latent structure (specified by the proportions 
7Tj), the p variates are independent. The \x\ are viewed in a nonparametric 
setting. 

In the next theorem, we prove identifiability of the model's parameters — 
that is, that P uniquely determines the factors appearing in (4) — under a 
mild and explicit regularity condition on P, as soon as there are at least 3 
variates and r is known. Making a judicious use of cut points to discretize 
the distribution, and then using Kruskal's work, we prove the following. 

Theorem 8. Let ¥ be a mixture of the form (4), such that for every 
j 6 {l,...,p}j the measures {^\}\<i< r are linearly independent. Then, if 
p>3, the parameters {ni, l^i\i<i<r,i<j<p are strictly identifiable from P, up 
to label swapping. 

This result also generalizes to nonparametric mixture models where at 
least three blocks of variates are independent conditioned on the latent 
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structure. Let b\, . . . , b p be integers with p > 3 the number of blocks, and the 
li\ be absolutely continuous probability measures on M. b i . With m = 
consider the mixture distribution on W 71 given by 

(5) p=x>n/i 

t=l j=l 

Theorem 9. Let P be a mixture of the form (5), such that for every 
j £ {1, . . . ,p}, the measures {fJ-j}i<i< r on M. j are linearly independent. Then, 
if p > 3, the parameters {m, rfi}i<i<r,i<j<p are strictly identifiable from P, 
up to label swapping. 

Both Theorems 8 and 9 could be strengthened somewhat, as their proofs 
do not depend on the full power of Kruskal's theorem. As an analog of 
Kruskal rank for a matrix, say a finite set of measures has Kruskal rank 
k, if k is the maximal integer such that every /c-element subset is linearly 
independent. Then, for instance, when p = 3, straightforward modifications 
of the proofs establish identifiability provided the sum of the Kruskal ranks 
of the sets {/4}i<i<r for j = 1,2,3 is at least 2r + 2. 

Note that an earlier result linking identifiability with linear independence 
of the densities to be mixed appears in [43] in a parametric context. Com- 
bining this statement with the one obtained by Teicher [42], we get that 
in the parametric framework, a sufficient and necessary condition for strict 
identifiability of finite mixtures of product distributions is the linear inde- 
pendence of the univariate components (this statement does not require the 
knowledge of r). In this sense, our result may be seen as a generalization of 
these statements in the nonparametric context. 

Our results should be compared to two previous ones. First, Hettmans- 
perger and Thomas [26] studied the identical marginals case, where for each 
1 < i < r, we have n\ = n\ for all 1 < j,k < p (with corresponding c.d.f. 
Fi). They proved that, as soon as p > 2r — 1, and there exists some c 6 
M, such that the {i ? i(c)}i<j< r are all different, the mixing proportions 7Tj 
are identifiable. Although they do not state it (because they are primarily 
interested in estimation procedures and not identifiability), they also identify 
the c.d.f.s Fi(c), 1 < i < r, at any point c£R such that the {Fi(c)}i<i< r are 
all different. 

Second, Hall and Zhou [25] proved that for r = 2, if p > 3 and the mixture 
P is such that its two-dimensional marginals are not the product of the 
corresponding one-dimensional ones, that is, that for any 1 < j, k < p we 
have 



(6) 



P(X i ,X fe )/P(A J )P(X fc ), 
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then the parameters 7r and F- are uniquely identified by P. They also provide 
consistent estimation procedures for the mixing proportions as well as for 
the univariate c.d.f.'s {Fj 1 }i<j<p,i=i,2- 

Hall and Zhou state that their results, "apparently do not have straight- 
forward generalizations to mixtures of three or more products of independent 
components." In fact, Theorem 8 can already be viewed as such a general- 
ization, since in the r = 2 case we will show that inequality (6) is in fact 
equivalent to the independence for each j of the set {fJ-j}i<i<2- 

To develop a more direct generalization of the condition of Hall and Zhou, 
we say a bivariate continuous probability distribution is of rank r if it can 
be written as a sum of r products of signed univariate distributions (not 
necessarily probability distributions), but no fewer. We emphasize that we 
allow the univariate distributions to have negative values, even though the 
bivariate does not. The related notion of nonnegative rank, which addition- 
ally requires that the univariate distributions be nonnegative, will not play 
a role here. 

This definition of rank of a bivariate distribution is a direct generalization 
of the notion of the rank for a matrix, with the bivariate distribution re- 
placing the matrix, the univariate ones replacing vectors, and the product of 
distributions replacing the vector outer product. Moreover, in the case r = 2, 
inequality (6) is equivalent to saying P(X,-,JQ.) has rank 2, since its rank is 
at most 2 from the expression (4), and if it had rank 1 then marginalizing 
would show F(X j ,X k )=F(X j )F(X k ). 

Next we connect this concept to the hypotheses of Theorem 8. 

Lemma 10. Consider a bivariate distribution of the form 

r 

P(X l ,X 2 )=Y / ^}(Xi)^(X 2 ). 
1=1 

Then P(Xi,X2) has rank r if, and only if, for each of j = 1,2 the measures 
{/4}l<i<r are linearly independent. 

From Theorem 8, this immediately yields the following. 

Corollary 11. Let ¥ be a mixture of the form (4), and suppose that 
for every j £ {1, . . . ,p}, there is some k 6 {1, . . . , p} such that the marginal 
F(Xj,X k ) is of rank r. Then, ifp>3, the parameters {-7Tj, \i\ }i<i<r,i<j<p 
are strictly identifiable from P, up to label swapping. 

Note that if the distribution P arising from (4) could be written with 
strictly fewer than r different product distributions, then the assumption of 
Corollary 11 (and of Theorem 8) would not be met. Here, we state that a 
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slightly stronger condition, namely that any two-variate marginal of IP can- 
not be written as the sum of strictly fewer than r product components, suf- 
fices to ensure identifiability of the parameters. Note also that the condition 
appearing in Theorem 8 is stated in terms of the parameters of the distri- 
bution P, which are unknown to the statistician. However, the rephrased 
assumption appearing in Corollary 11 is stated in terms of the observation 
distribution. Thus one could imagine testing this assumption on the data 
(even if this might be a tough issue) prior to statistical inference. 

The restriction to p > 3, which arises in our method from using Kruskal's 
theorem, is necessary in this context. Indeed, in the case of r = 2 groups 
and p = 2 variates, [25] proved that there exists a two-parameter continuum 
of points (it, {/4}i<i<2,i<j<2) solving equation (4). This simply echos the 
nonidentifiability in the case of 2 variates with finite state spaces commented 
on in [30]. 

For models with more than 2 components in the mixture, Benaglia, Chau- 
veau and Hunter [4] recently proposed an algorithmic estimation procedure, 
without insurance that the model would be identifiable. Our results states 
that under mild regularity conditions, it is possible to identify the param- 
eters from the mixture ¥, at least when the number of components r is 
fixed. Thus our approach gives some theoretical support to the procedure 
developed in [4]. 

Finally, recall that in [14, 24], an upper bound on the number of variates 
needed to ensure "generic" identifiability of the model is claimed which is 
of the order rlog 2 (r). Our Theorem 8 lowers this bound considerably, as it 
shows 3 variates suffice to identify the model, regardless of the value of r 
(under a mild regularity assumption). 

8. Proofs. 

Proof of Corollary 2. For each j = 1,2,3, let Mj be the matrix 
of size r x Kj describing the probability distribution of Xj conditional on 
Z. More precisely, the zth row of Mj is p^,- = ¥(Xj = ■ \ Z = i), for any 
i £ {1, . . . ,r}. Let M\ be the matrix of size r x k± such that its ith row is 
TTiPa = TTi¥(X\ = ■ | Z = i). Kruskal ranks of Mj and Mi are denoted Ij and 
It, respectively. We have already seen that the tensor [Mi, M2, M3] describes 
the probability distribution of the observations (X\ , X2 , X3). Kruskal's result 
states that, as soon as Kruskal ranks satisfy the condition I\ + I2 + I3 > 
2r + 2, this probability distribution uniquely determines the matrices Mi, M2 
and M3 up to rescaling of the rows and label swapping. Note that as it 
has positive entries, Kruskal rank I\ is equal to l\. Moreover, using that 
the matrices Mi , M2 and M3 are stochastic and that the entries of tt are 
positive, the corollary follows. □ 
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Proof of Corollary 3. We first show that for any fixed choice of 
a positive integer Ij < min(r, Kj), those r x Kj matrices Mj, whose Kruskal 
rank is strictly less than Ij , form a proper algebraic variety. But the matrices 
for which a specific set of i, rows are dependent is the zero set of all Ij x Ij 
minors obtained from those rows. By taking appropriate products of these 
minors for all such sets of Ij rows we may construct a set of polynomials 
whose zero set is precisely those matrices of Kruskal rank less than Ij . This 
variety is proper, since matrices of full rank do not lie in it. 

Thus the set of triples of matrices (Mi,M 2 ,Ms) for which the Kruskal 
rank of is strictly less than min(r, /%) forms a proper subvariety. For 
triples not in this subvariety, our assumptions ensure that the rank inequal- 
ity of Corollary 3 holds, so the inequality holds generically. If the tt^s are 
fixed and positive, the proof is complete. Otherwise, note that the set of 
parameters with vectors 7r admitting zero entries is also a proper subvariety 
of the parameter set. □ 

Proof of Theorem 4. Our goal is to apply Kruskal's result to models 
with more than 3 observed variables by means of a "grouping" argument. 
We require a series of lemmas to accomplish this. 

First, given annxoi matrix A\ and an iixaj matrix A 2 , define the 
n x a\a 2 matrix A = A\ <g) row A 2 , as the row-wise tensor product, so that 

A(i, a 2 (j - 1) + k) = A x (i, j)A 2 (t, k) . 

The proof of the following lemma is straightforward and therefore omitted. 

Lemma 12. // conditional on a finite random variable Z, the random 
variables X\,X 2 are independent, with the distribution of Xi conditional 
on Z given by the matrix Ai of size r x ai, then the row tensor product 
A = A\ <g> row A 2 of size r x {a\a 2 ) contains the probability distribution of 
(X\,X 2 ) conditional on Z. 

For each j 6 {1, . . . ,p}, denote by Mj the r x kj matrix whose ith row 
is P(Xj = • I Z = i). Introduce three matrices iVj, i = 1,2,3, of size r x Ki, 
defined as 

row 

and the tensor N = [N±, N2, N3], where the ith row of N\ is tt{ times the 
ith row of N\. According to Lemma 12, the tensor N contains the proba- 
bilities of the three clumped variables ({-Xj}jGSu {Xj}j£S 2 i {Xj }j£S 3 )• Thus 
knowledge of the distribution of the observations is equivalent to knowledge 
of N. Moreover, for parameters tv having positive entries (which is a generic 
condition), the Kruskal ranks of N\ and N\ are equal. 
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In the next lemma we characterize the Kruskal rank of the row-tensor 
product obtained from generic matrices Ai. 

Lemma 13. Let Ai, i = 1, . . . ,q, denote rxa ; matrices, a = 111=1 a % an d 

row 

A= (g) A, 

i=l,...,q 

the r x a matrix obtained by taking tensor products of the corresponding rows 
of the Ai . Then for generic Ai 's, 

r&nkx A = rank A = min(r, a). 



Proof. The condition that a matrix A not have full rank (resp. full 
Kruskal rank) is equivalent to the simultaneous vanishing of its maximal 
minors (resp. idem when r <a, and equivalent to the existence of one van- 
ishing maximal minor when r > a). Composing the map sending {Ai} — > A 
with these minors gives polynomials in the entries of the Ai. To see that the 
polynomials in the entries of the Ai are nonzero, it is enough to exhibit a 
single choice of the Ai for which A has full rank (resp. full Kruskal rank). 

Let Xij, i = 1, . . . , q, j = 1, . . . , dj, be distinct prime numbers. Consider Ai 
defined by 



/ 1 



At 



™ 2 
X il 



Wr 1 



i 

Xi2 
™2 
x i2 



X; 



i2 



1 



\ 



T — 

X ia,i 



7 



For any vector y € C* , let W(y) = W(yi, 2/2, ■ ■ ■ -,Vt) denote the (xt Vander- 
monde matrix, with entries Vj~- 

Suppose first that a>r. Then the rows of A are the first r rows of the 
Vandermonde matrix W(y), where y is a vector whose entries are YliXi^ 
for choices of 1 < ji < aj . As the products Jli x i^ are distinct by choice of 
the x^, W(y) is nonsingular, so A has rank and Kruskal rank equal to r. 

If instead r > a, then the first a rows of A form an invertible Vander- 
monde matrix. Thus A is of rank a. To argue that A has full Kruskal rank, 
compose the map {Ai} — > j4 with the a x a minor from the first a rows of 
A. This gives us a polynomial in the entries of the Ai, the nonvanishing 
of which ensures the first a rows of A are independent. This polynomial is 
not identically zero, since a specific choice of the Ai's such that the first a 
rows of A are independent has been given. By composing this polynomial 
with maps that permute rows of all the Ai simultaneously, we may construct 
nonzero polynomials whose nonvanishing ensures all other sets of a rows of 
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A are independent. The proper subvariety denned by the product of these 
polynomials, then, is precisely those choices of {Ai} for which A is not of 
full Kruskal rank. This concludes the proof of the lemma. □ 

Returning to the proof of Theorem 4, note that to apply the preceding 
lemma to stochastic matrices Mj , we must address the fact that each row of 
each Mj sums to 1. However, as both rank and Kruskal rank are unaffected 
by multiplying rows by nonzero scalars, and rows sums being nonzero is a 
generic condition (defined by the nonvanishing of linear polynomials), we 
see immediately the conclusion of Lemma 13 holds when all the Mj are 
additionally assumed to have row sums equal to 1. 

We thus see that for generic Mj, the matrices Ni defined above have 
Kruskal rank Ii = min(r, Kj). Now by assumption, the matrices Ni satisfy 
the condition of Corollary 2. This implies that the tensor N = [Ni,N2,N$\ 
uniquely determines the matrices Ni and the vector n, up to permutation 
of the rows. We need a last lemma before completing the proof of Theorem 
4. 

Lemma 14. Suppose A = <S>1™ q Ai where the Ai are stochastic ma- 
trices. Then the Ai are uniquely determined by A. 

Proof. Since each row of each Ai sums to 1, one easily sees that each 
entry in Ai can be recovered as a sum of certain entries in the same row of 
A. □ 

Using this lemma, we have that each Ni uniquely determines the matrices 
Mj for j € Si, and Theorem 4 follows. □ 

Proof of Corollary 5. It is enough to consider the case where p = 
2|~log 2 r] + 1. With k = [log 2 r], we have that 2 fc_1 <r <2 k . Choosing 

m = K 2 = 2 k , k 3 = 2, 

inequality (2) in Theorem 4 holds. □ 

Proof of Theorem 6. The 2k + 1 consecutive observed variables can 
be taken to be Xq, X\, . . . ,X 2k . Note that the transition matrix from Zi to 
is given by A' = diag(7r)~ 1 yl T diag(7r). 

Let B\ be the rxn k matrix giving probabilities of joint states of Xq, X± , . . . , 
Xk-i conditioned on the states of Z^. Similarly, let B 2 be the r x K k matrix 
giving probabilities of joint states of Xk+i, ■ ■ ■ , X 2 k conditioned on the states 
of Z k . 

Now the joint distribution for the model A4(r; K k , K k , k) with parameters 
7r,Bi,B 2 ,B is the same as that of the HMM with parameters 7r, A, B. Thus 
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we apply Kruskal's theorem, after we first show iv,Bi,B 2 are sufficiently 
generic to do so for generic choices of A, B. The entries of 7r have been 
assumed to be positive. With Im denoting the Kruskal rank of a matrix M, 
in order to apply Corollary 2, we want to ensure 

I Bl + Ib 2 +I B >2r + 2. 

Making the generic assumption that B has Kruskal rank at least 2, it is 
sufficient to make 

iBi , Ib 2 > r, 

that is, to require that B\,B 2 have full row rank. 
Now B\ , B 2 can be explicitly given as 

B 1 = A'(B <g> row (• • ■ A'(B <g> row (A'(B ® row (A'B)))) ■ ■ •)), 

(7) 

B 2 = A(B ® row (• • • A(B ® row (A(B ® row (AB)))) ■ • •)) 

with k copies of A' and of B appearing in the expression for B\, and k 
copies of A and B appearing in that for B 2 . To show these have full row 
rank for generic choices of stochastic A and B, it is enough to show they 
have full row rank for some specific choice of stochastic A,B,tv, since that 
will establish that some r x r minors of B\,B 2 are nonzero polynomials in 
the entries of A',B and A, B, respectively. For this argument, we may even 
allow our choice of A to lie outside of those usually allowed in the statistical 
model, as long as it lies in their (Zariski) closure. We therefore choose A 
to be the identity, and n arbitrarily, so that A' is also the identity, thus 
simplifying to considering 



(8) 



Bi = B 2 = B 



B 



B (k factors). 



It is now enough to show that B\, as given in (8), has full row rank for some 
choice of stochastic B. We proceed very similarly to the proof of Lemma 13, 
but since a row tensor power occurs here rather than an arbitrary product, 
we must make some small changes to the argument. 

Since nonzero rescalings of the rows of B have no effect on the rank of 
Bi in (8), we do not need to require that the row sums of B are 1. So let 
x = (x\ , x 2 , . . . , x K ) be a vector of distinct primes, and define B by 



/ 1 



B 



1 

X 2 
™ 2 
■' 2 



1 \ 



2 K 
-,,2 



/ 



, T- ^ 

Let y = x<S>x(g}---(g)x with k factors. Then using notation from Lemma 13, 
B\ will be the first r rows of the Vandermonde matrix W(y). To ensure B\ 
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has rank r , it is sufficient to ensure that r of the entries of y are distinct, since 
then B\ has a nonsingular Vandermonde submatrix of size r. The number of 
distinct entries of y is the number of distinct monomials of degree k in the 
%i, l<i< k. This number is (^"T 1 ), so to ensure that generic A,B lead to 
B\ having full row rank, we ask that k satisfy 



For fixed k, the expression on the left of this inequality is an increasing and 
unbounded function of k, so this condition can be met for any r, k. 

Thus by Kruskal's theorem, from the joint distribution of 2k + 1 consecu- 
tive variables of the HMM for generic A, B, we may determine PB\ , PB2, PB, 
where P is an unknown permutation. 

Now to identify A, B up to label swapping means to determine A = PAP* 
and B = PB for some permutation P. As B has been found, we focus on A. 
From (7) one finds 



In this expression, A and B appear k times. Since each row of B sums to 1, 
by appropriate summing of the columns of this matrix (marginalizing over 
the variable Xik ) , we may determine a matrix M given by a similar formula, 
but with only k — 1 occurrences of A and B. Then 



As PB2 and B (g) row M are known and generically of rank r, from this 
equation one can identify the matrix A. 

Thus the HMM parameters A and B are identifiable up to permutation 
of the states of the hidden variables. □ 

PROOF of Theorem 7. For the n node model, with node set V n = 
{vk}i<k<n, denote (undirected) edges in the complete graph K n on V n 
by (vkiVi) = (vi,Vk) for k^l. We assume < m < iT2 < 1, Pij € [0, 1] and 
Pll,Pl2,P22 are distinct. 

Let Z = (Z±, Z2, • • • , Z n ) be the random variable, with state space {1, 2} n , 
which describes the state of all nodes collectively. Ordering the elements of 
the state space in some way, we find the probabilities of the various states 
of Z are given by the entries of a vector v 6 M 2 " , all of which have the form 
7T i 7T 2~ k ■ Observe for later use that no entries of v are zero, and the smallest 
and largest entries are 7r™ and n^, respectively (if m = 1^2 = 1/2, then all 
the entries of v are equal to 2~ n ). 

Elements of the state space of Z are specified by X = 12, ■ ■ ■ , i n ) £ 
{l,2} n , meaning is the state of v/.- An assignment of states to all edges 




PB 2 = A(B ® row (• • • A(B O row (A(B ® row (AB)))) •■•))■ 



PB 2 =A(B ® row M). 
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in a graph G C K n will be represented by a subgraph Q QG containing only 
those edges in state 1, in accord with the interpretation of edge states 
and 1 as "absent" and "present." We refer to the probability of a particular 
state assignment to the edges of G as the probability of observing the cor- 
responding Q. We may think of any such G as specifying a composite edge 
variable, whose states are represented by the subgraphs Q QG. 

To relate the random graph model to the model of Kruskal's theorem, we 
must choose three observed variables and one hidden variable that reflect a 
conditional independence structure. The hidden variable will be Z described 
above, indicating the state of some number of nodes n, to be chosen below. 
The observed variables will correspond to three pairwise edge-disjoint sub- 
graphs Gi,G2,6?3 of K n . By choosing the Gi to have no edges in common, 
we ensure that for i ^ j observing any subgraph Qi of Gi is independent of 
observing any subgraph Qj of Gj, conditioned on the state of Z. To meet 
the technical assumptions of Kruskal's theorem, we will also choose the Gi 
so that the three matrices Bi whose entries give probabilities of observing 
each subgraph Qi of Gi, conditioned on the state of Z, have full row rank. 
These matrices thus give conditional probabilities of observations marginal- 
ized over all edges not in G. L . 

The construction of the Gi proceeds in several steps. We begin by consid- 
ering a small complete graph, and an associated matrix: for a set of 4 nodes, 
define a 2 4 x 2(2) = 16 x 64 matrix A, with rows indexed by assignments 
2 € {1,2} 4 of states to the nodes, columns indexed by all subgraphs Q of 
K4 and entries giving the probability of observing the subgraph conditioned 
on the state assignment of the nodes. Each entry of A is thus a monomial 
in the pij and qij = 1 — pij. Explicitly, if 2 = (ij, 12, 13, 14), and e^i € {0, 1} is 
the state of edge (vk,vi) in Q, the (2, C/)-entry of A is 

TT p ?H l-tkl 
11 F «fe«i y «fc«i 

l<fc<Z<4 

Lemma 15. For distinct Pn,pi2,P22, the 16 x 64 matrix A described 
above has full row rank. 

This lemma can be established by a rank computation with symbolic 
algebra software, so we omit a proof. One can also see, either through com- 
putation or reasoning, that the complete graph on fewer than 4 nodes fails 
to produce a matrix of full rank. 

The next lemma shows we can find the 3 edge-disjoint subgraphs needed 
for the application of Kruskal's theorem. As the rest of the proof does not 
depend on the nodes having 2 states, we state the following lemma for an 
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arbitrary number of node states. The more general formulation we prove 
here will be needed in a subsequent paper. 

Let r denote the number of node states and suppose we have found a 
number m such that the r m x 2(2) matrix A of probabilities of observations 
of subgraphs of the complete graph on m nodes conditioned on node states 
has rank r m . Lemma 15 establishes that for r = 2, we may take m = 4. 

Lemma 16. Suppose for the r -node-state model, the number of nodes m 
is such that the r m x 2(2) matrix A of probabilities of observing subgraphs of 
K m conditioned on node state assignments has rank r m . Then with n = m 2 
there exist pairwise edge-disjoint subgraphs Gi,G2,G% of K n such that for 
each Gi, the matrix B-i of probabilities of observing subgraphs of Gi condi- 
tioned on node state assignments has rank r n . 

Proof. We first describe the construction of the subgraphs Gi,G2,G% 
of K n . For each Gi, we partition the m 2 nodes into m groups of size m in a 
way to be described shortly. Then Gi will be the union of the m complete 
graphs on each partition set. Thus Gi has m(™) edges. 

For conditional independence of observations of edges in Gi, from those 
in Gj with i ^ j, we must ensure Gi and Gj have no edges in common. This 
requires only that a partition set of nodes leading to Gi has at most one 
element in common with a partition set leading to Gj, if i ^ j. Labeling 
the nodes by (i, j) 6 {1, ... , m} x {1, . . . , m}, we picture the nodes as lattice 
points in a square grid. We take as the partition leading to G\ the rows of 
the grid, as the partition leading to Gi the columns of the grid and as the 
partition leading to G3 the diagonals. Explicitly, if V% = {VJ\j £ {1, . . . , m}} 
denotes the partition of the node set V m 2 leading to Gi, then 

Vl = {(j,i)\ie{i,...,m}}, 
V 2 = {(i,j)\ie{i,...,m}}, 

V? = {(i,i + j modm)|i G {1, . . . ,m}} 

and each Gi is the union over j £ {1, . . . , m} of the complete graphs on node 
set Vf. 

Now Bi, the matrix of conditional probabilities of observing all possible 
subgraphs of Gi, conditioned on node states, has r n rows indexed by com- 
posite states of all n = m 2 nodes, and 2 m d) columns indexed by subgraphs 
of Gi . Observe that with an appropriate ordering of the rows and columns 
(which is dependent on i), Bi has a block structure given by 



(9) 



Bi = A®A®---(giA (m factors) . 
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[Note that since A is r m x 2(2), the tensor product on the right is (r m )' m x 

( rn \ 2 (m\ 

(21 2 ))m which 

is r m x 2™ 2 1 ; the size of Bi.] That I?j is this tensor product 
is most easily seen by noting the partitioning of the m 2 nodes into m disjoint 
sets Vj gives rise to m copies of the matrix A, one for each complete graph 
on a Vj. The row indices of Bi are obtained by choosing an assignment of 
states to the nodes in VJ for each j independently, and the column indices 
by the union of independently-chosen subgraphs of the complete graphs on 
VJ for each j. This independence in both rows and columns leads to the 
tensor decomposition of B{. 

Now since A has full row rank, (9) implies that Bi does as well. □ 

Remark 1 . For future work, we note that this lemma easily generalizes 
to graph models in which edges may be in any of s states, with s > 2. In 
that case, the matrix A is r m x 5(2), and the columns of A are no longer 
indexed by subgraphs of K m , but rather by s-colorings of the edges of K m . 

To complete the proof of Theorem 7, we apply Corollary 2 for M. (2 m2 ; 2 m ( 2 ) ; 
2-(T),2 m (2)) 

to the parameter choice 7r = v, Mi = Bi, to find v and each 
Bi is identifiable, up to row permutation. Thus here we do not apply the 
corollary to the full random graph model, but rather its marginalization over 
all edges not in Gi U G2 U G3 . 

2 2 

Suppose now that tt\ ^ tt2- Since 7r™ , 7r™ are the smallest and largest 
entries of v, respectively, we may determine 7Ti,7T2, as well as which of the 
rows of Bi correspond to the having all nodes in state 1 or all in state 2. 
Summing appropriate entries of these rows, we obtain the probabilities pn 
and P22 of observing a single edge conditioned on these node states. (This is 
simply a marginalization; sum the row entries corresponding to all subgraphs 
of Gi that contain a fixed edge.) To find p\2, by consulting v we may choose 
one of the n rows of B\ which corresponds to node states with all nodes but 
one in state 1. By considering sums of row entries to obtain the conditional 
probability of observing a single edge, we can produce the numbers pn, p\2- 
As pn is known, and p\2 is distinct from it, we thus determine p\2- 

If 7Ti = 7T2 , then we cannot immediately determine which rows of Bi corre- 
spond to all nodes being in state 1 or in state 2. However, by marginalizing 
all rows to obtain the conditional probability of observing a single edge, we 
may determine the set of numbers {pn,Pi2,P22}- With these in hand, we 
may then determine which two rows correspond to having all nodes in state 
1 and all in state 2. This then uniquely determines which of the numbers is 
P12, so everything is known up to label swapping. □ 

Remark 2. While the above argument shows generic identifiability of 
the parameters of the 2-node state random graph model provided there 
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are at least 16 nodes, a slightly more complicated argument, which we do 
not include here, can replace Lemma 16 to establish generic identifiability 
provided there are at least 10 nodes. Thus we make no claim to having 
determined the minimum number of nodes to ensure identifiability. 

Proof of Theorem 8. We assume as usual that Z is a latent random 
variable with distribution on {l,...,r} given by the vector 7r, and X = 
(Xi, . . . , X p ) is the vector of observations such that conditional on Z = i, the 
variates {Xj}i<j< p are independent, each Xj having probability distribution 
n\. We focus on 3 random variables at a time only, beginning first with 
X\ , X2 , X3 . The idea is to construct a binning of the random variables X\ , X2 
and X3 using kj — 1 6 N cut points for Xj. For each j = 1,2,3, consider a 
partition of M. into Kj consecutive intervals {Ij}x<k<itji an d consider the 
random variable Yj = (1{X? € ij}, . . . ,l{Xj £ Ij j }), where 1{A} denotes 
the indicator function of set A. This is a finite random variable taking values 
in {0, l} Kj with at most one nonzero entry. We will show here that we can 
identify the proportions 7Tj and the probability measures , 1 < i < r, 1 < 
j < 3, relying only on the binned observed variables {Yi, 12,13} f° r some 
well-chosen partitions of M. 

Consider for each j = 1, 2, 3, the matrices Mj of size r x Kj whose iih row is 
the distribution of Yj conditional on Z = i, namely the vector [P(Xj € lj\Z = 
i), . . . ,¥(Xj € Ij J \ Z = i)]. Introduce the matrix M\ whose ith row is the iih 
row of M\ multiplied by the value 7rj. Note that the Mj's are stochastic 
matrices. Moreover, the tensor product [Mi, M2, M3] is the ki x K2 X K3 table 
whose (ki,k 2 ,k 3 ) entry is the probability P((X 1 ,X 2 ,X 3 ) £ if 1 x I^ 2 x 1^). 
This tensor is completely known as soon as the probability distribution (4) 
is given. Now we use Kruskal's result to prove that with knowledge of the 
tensor [Mi, M 2 , M3], we can recover the parameters 7r and the stochastic 
matrices Mj,j = 1,2,3. If we can do so for general enough and well-chosen 
partitions {Ij}i<k<Kj> then we will be able to recover the measures for 
j = 1,2,3 and 1 < i < r. 

We look for partitions {/ fc }i<fc<K) with K>r, such that the corresponding 
matrix M has full row rank. Here we deliberately dropped the index j = 
1,2,3. If we can construct these matrices with full row rank, then we get 
Ii + I 2 + I3 = 3r > 2r + 2 and Kruskal's result applies. As the partition 
{I k }i<k<K is composed of consecutive intervals, the rows of the matrix M are 
of the form [F i (u 1 ),Fi(u2)-F i (u 1 ),.. . , Fi(u R -i) -Fi(it«_ 2 ), 1 - Fi(u K -i)\ for 
some real number cut points ui <u 2 < ■ ■■ < u K -i. Replacing the jth column 
Cj of M by Cj + Cj-i for consecutive j from 2 to k, we construct M 1 with 
same rank as M and whose ith row is [Fi(ui) , Fi(u 2 ) , . . . 1]. Now 

linear independence of the probability distributions {fii}i<i< r is equivalent 
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to linear independence of the c.d.f.s {.Fi}i<i< r . We need the following lemma. 

Lemma 17. Let {-Fi}i<i< r be linearly independent functions onR. Then 
there exists some k£M and real numbers u\ < U2 < ■ ■ ■ < such that the 
vectors 

{(-Fj(ui), . . . ,Fi(u K -i), l)}l<j< r 

are linearly independent. 

Proof. Let us consider a set of points u\ < U2 < • ■ ■ < u m in R with m > 
r and the matrix A m of size r x m whose ith. row is (i^(tii), . . . ,Fi(u m ), 1). 
Denote by A/" m = {a € R r | cxA m = 0}, the left nullspace of the matrix A m , 
and let d m be its dimension. If d m = 0, then the matrix A m has full row 
rank and the proof is complete. Now if d m > 1, choose a nonzero vector 
ol 6 A/" m . By linear independence of the Fj's, we know that £)i=i ociFi is not 
the zero function, which means that there exists some it m +i € R such that 
J2i=i a iFi( u m+i) 7^ 0. Up to a reordering of the it's, we may assume u m < 
u m+ \ and consider the matrix A m+ \ whose ith row is [Fi(u\), . . . , Fj(n m+ i), 1] 
and whose left nullspace M m+ i has dimension d m+ \ < d m . Indeed, we have 
A/" m +i C Mm an d by construction, the one-dimensional space spanned by 
the vector a is not in M m +i- Repeating this construction a finite number of 
times, we find a matrix A K with the desired properties. □ 

With this lemma, we have proved that the desired partition exists. More- 
over, for any value t £ R, we may, by increasing k, include t among the 
points Uk without lowering the rank of the matrix. Thus we can construct 
partitions that involve any chosen cut point in such a fashion that Kruskal's 
result in the form of Corollary 2 will apply. That is, the vector tv and the 
matrices Mj may be recovered from the mixture P, up to permutation of 
the rows. Moreover, summing up the first columns of the matrix Mj, up 
to the one corresponding to the chosen cut point t, we obtain the value of 
Fi(t)i 1 J = 1,2,3. To see that this enables one to recover the whole 

probability distribution up to label swapping on the z's indexes, note 
that once we fix an ordering on the states of the hidden variable, the rows 
(.Fj(iti), . . . , Fi(u K ))i<i< r are fixed and for each value of t G R, we associate 
to the ith. row the value Fi(t). 

To conclude the proof, in the case of more than 3 variates, we repeat 
the same procedure with the random variables Xi,X2,X^. This enables us 
to recover the values of {fij , [if , /j,f}i<i< r up to a relabeling of the groups. 
As soon as the /m- are linearly independent, they must be different, and 
using the two sets {fij, fif, //f }i<i< r and {fj,j,fj,f,fj,f}i<i< r which are each 
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known only up to (different) label swappings, we can thus recover the set 
{/ij, nf, Hi, ^f}i<i<r up to a relabeling of the groups. Adding a new random 
variable at a time finally gives the result. □ 

Proof of Theorem 9. In case of nonunidimensional blocks of inde- 
pendent components, we proceed much as in the proof of Theorem 8, but 
construct a binning into product intervals. For instance, if X is two dimen- 
sional, we use k 2 different bins, constructing Y = (1{X £ I 1 x J 1 }, 1{X E 
I 1 x J 2 }, . . . , 1{X E 1 K x J K }) where { J k }i< k < K is a second partition of R 
into k E N consecutive intervals. This yields a matrix M whose rows are of 
the form 

(Fi(ui,v 1 ),F i (ui,v 2 ) - Fi(«i,wi), . . .,Fj(m,+cx>) - F(ui,t> K -i), 
Fi{u 2 ,vi) - Fi(ui,vi),Fi(u2,V2) - Fi(ui,v 2 ) 

- Fi(u 2 ,vi) + Fi(ui,vi), 

Fi(u 2 , +oo) - Fi(ui,+oo) - Fi(u 2 ,v K -i) + Fi(m, v K -i), 
Fi(+oo,wi) - F i (« K _i,wi),i ; i(+oo,'U2) - Fi(u K -i,v 2 ) 

- Fi(+oo,ui) + Fi(u K -i,vi), 

1 - Fi(u K -i,+oo) - Fi(+oo,v K -i) + Fi(u K -i,v K -i)) 

for some real numbers u\ < u 2 < • • • < u K -\ and v\ < v 2 < ■ ■ ■ < v K -\. (To 
avoid cumbersome formulas, we only write the form of the matrix rows in the 
case b = 2.) This matrix has the same rank as M' whose ith row is composed 
of the values Fi{uk,v{) for 1 < k, I < k, using the convention u K = v K = +oo. 
The equivalence between linear independence of the probability distributions 
and corresponding multidimensional c.d.f.'s remains valid. 
Lemma 17 generalizes to the following. 

Lemma 18. Let {Fj}i<j< r be linearly independent functions onM. b . There 
exists some k, and b collections of real numbers u\ < u\ < ■ ■ ■ < u t t ,_ 1 , for 
1 < % < b, such that the r row vectors composed of the values {F^n^, . . . , 
u\ )\i\, . . . , % E {1, . . . , k}}, for 1 < i < r are linearly independent. 

The proof of this lemma is essentially the same as the proof of Lemma 
17. The only difference with the previous setup is that now the construc- 
tion of the desired set relies on addition of b coordinates at a time, namely 
t±, . . . ,% E R, which results in adding £)j=o Cj)^ columns in the matrix. 




To complete the argument establishing Theorem 9, we may again include 
any point (t\, . . . ,tb) E M. b among the u^'s without changing the row ranks 
of the matrices to which we apply Kruskal's theorem. Thus we may recover 
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the values F% (t% , . . . , i&) , . . . , F r (t± , . . . , t\,) , and conclude the proof in the same 
way as the last theorem. □ 

PROOF of Lemma 10. Suppose the probability distribution 

F(X 1 ,X 2 ) = j2^l(X 1 )tf(X 2 ) 

i=l 

has rank r. Then for k = 1, 2, the sets {/if}i<i<r must be independent, since 
any dependency relation would allow P to be expressed as a sum of fewer 
products. 

Conversely, suppose for k = 1, 2, the measures {Hi}i<i<r are independent. 
The corresponding sets of c.d.f.s {F^}i<i< r are also independent, and thus 
we may choose collections of points {tj}i<j< r such that the r x r matrices 
Mfc whose i,j-entries are F^(tj) have full rank. Then with F denoting the 
c.d.f. for P, the matrix N with entries F(tj,t"j) can be expressed as 

N = Af 1 T diag(7r)M 2 

and therefore has full rank. But if the rank of P were less than r, a similar 
factorization arising from the expression of P using fewer than r summands 
shows that N has rank smaller than r. Thus the rank of P is at least r, and 
since the given form of P shows the rank is at most r, it has rank exactly r. 
□ 
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